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RECURSIVE BIAS ESTIMATION AND L2 BOOSTING 

By Pierre- Andre Cornillon, Nicolas Hengartner and Eric 

matzner-l0ber 

', Montpellier SupAgro, University Rennes 2 and Los Alamos National 

I Laboratory 

04 , This paper presents a general iterative bias correction procedure 

' for regression smoothers. This bias reduction schema is shown to cor- 

I respond operationally to the L2 Boosting algorithm and provides a 

new statistical interpretation for L2 Boosting. We analyze the be- 
' havior of the Boosting algorithm applied to common smoothers 5* 

^ which we show depend on the spectrum of / — 5*. We present exam- 

ples of common smoother for which Boosting generates a divergent 
sequence. The statistical interpretation suggest combining algorithm 
with an appropriate stopping rule for the iterative procedure. Finally 
we illustrate the practical finite sample performances of the iterative 
smoother via a simulation study, simulations. 

> 

1. Introduction. Regression is a fundamental data analysis tool for 
uncovering functional relationships between pairs of observations {Xi,Yi),i = 
^ . 1, . . . , n. The traditional approach specifies a parametric family of regression 

On I functions to describe the conditional expectation of the dependent variable Y 

given the independent variables X € W, and estimates the free parameters 

■ by minimizing the squared error between the predicted values and the data. 
An alternative approach is to assume that the regression function varies 

' smoothly in the independent variable x and estimate locally the conditional 

■ expectation of Y given X. This results in nonparametric regression estkna- 



tors (e.g. Fan and Gijbels 13(], Hastie and Tibshirani 19(], Simonoff [3J]). 
The vector of predicted values Yi at the observed covariates Xi from a non- 
^ ' parametric regression is called a regression smoother, or simply a smoother, 

^ . because the predicted values Yi are less variable than the original observa- 

tions Yi. 

Over the past thirty years, numerous smoothers have been proposed: 
running-mean smoother, running- line smoother, bin smoother, kernel based 
smoother (Nadaraya Q, Watson 0), spline regression smoother, smooth- 
ing splines smoother (Wahba [33], Whittaker [s^), locally weighted running- 
line smoother (Cleveland 0]), just to mention a few. We refer to Buja et al. 
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[H], Eubank 12], Fan and Gijbels [l^ . Hastie and Tibshirani 19] for more 
in depth treatments of regression smoothers. 

An important property of smoothers is that they do not require a rigid 
(parametric) specification of the regression function. That is, we model the 
pairs {Xi, Yi) as 



(1.1) Yi = m{Xi) + ei, i = l,...,n, 

where m(-) is an unknown smooth function. The disturbances are inde- 
pendent mean zero and variance cr^ random variables that are independent 
of the covariates Xi, i = 1, . . . ,n. To help our discussion on smoothers, we 
rewrite Equation (II. ip compactly in vector form by setting Y = (Yi, . . . , 1^)*, 
m = {m{Xi), . . . ,m{Xn)Y and e = (ei, . . . ,e„)*, to get 



(1.2) Y = m + e. 

Finally we write fh = Y = (Yi, . . . , Y^)*, the vector of fitted values from 
the regression smoother at the observations. Operationally, linear smoothers 
can be written as 



fh = S\Y, 

where S\ \s a n x n smoothing matrix. While in general the smoothing 
matrix will be not be a projection, it is usually a contraction (Buja et al. 
0]). That is, \\SxY\\ < \\Y\\. 

Smoothing matrices Sx typically depend on a tuning parameter, which 
denoted by A, that governs the tradeoff between the smoothness of the esti- 
mate and the goodness-of-fit of the smoother to the data. We parameterize 
the smoothing matrix such that large values of A will produce very smooth 
curves while small A will produce a more wiggly curve that wants to inter- 
polate the data. The parameter A is the bandwidth for kernel smoother, the 
span size for running-mean smoother, bin smoother, and the penalty factor 
A for spline smoother. 

Much has been written on how to select an appropriate smoothing pa- 
rameter, see for example (Simonoff [s^]). Ideally, we want to choose the 
smoothing parameter A to minimize the expected squared prediction error. 
But without explicit knowledge of the underlying regression function, the 
prediction error can not be computed. Instead, one minimizes estimates of 
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the prediction error using Stein Unbiased Risk Estimate or Cross- Validation 
(Li Q). 

This paper takes a different approach. Instead of selecting the tuning pa- 
rameter A, we fix it to some reasonably large value, in a way that ensures 
that the resulting smoothers oversmooths the data, that is, the resulting 
smoother will have a relatively small variance but a substantial bias. Ob- 
serve that the conditional expectation of the —R = —{Y — Y) given X is the 
bias of the smoother. This provides us with the opportunity of estimating 
the bias by smoothing the residuals R, thereby enabling us to bias correct 
the initial smoother by subtracting from it the estimated bias. The idea of 
estimating the bias from residuals to correct a pilot estimator of a re gres sion 
function goes back to the concept of twicing introduced by (Tukey [35|) to 
estimate bias from model misspecification in multivariate regression. Obvi- 
ously, one can iteratively repeat the bias correction step until the increase 
to the variance from the bias correction outweighs the magnitude of the 
reduction in bias, leading to an iterative bias correction. 



Another iterative function estimation method, seemingly unrelated to bias 
reduction, is Boosting. Boosting was introduced as a machine learning al- 
gorithm for combining multiple weak learners by averaging their weighted 
predictions (Freund [15|], Schapire 3l|)- The good performance of the Boost- 
ing algorithm on a variety of datasets stimulated statisticians to understand 
it from a statistical point of view. In his seminal paper, Breiman 0] shows 
how Boosting can be interpreted as a gradient descent method. This view 
of Boosting was reinforced by Friedman [l^. Adaboost, a popular variant 
of the Boosting algorithm, can be understood as a method for fitting an 
additive model (Friedman et al. [l3]) and recently Efron et al. 11 1 made a 
connection between L2 Boosting and Lasso for linear models. 

But connections between iterative bias reduction and Boosting can be 
made. In the context of nonparametric density estimation, Di Marzio and 
Taylor have shown that one iteration of the Boosting algorithm reduced 
the bias of the initial estimator in a manner similar to the multiplicative bias 
reduction methods (Hengartner and Matzner-L0ber [i^], Hjort and Glad 
[i^ ]. Jones et al. 25]). In the follow-up paper (Di Marzio and Taylor [^), 
they extend their results to the nonparametric regression setting and show 
that one step of the Boosting algorithm applied to an oversmooth effects a 
bias reduction. As expected, the decrease in the bias comes at the cost of 
an increase in the variance of the corrected smoother. 

In Section 2, we show that in the context of regression, such iterative 
bias reduction schemes obtained by correcting an estimator by smoothers of 
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the residuals correspond operationally to the L2 Boosting algorithm. This 
provides a novel statistical interpretation of L2 Boosting. This new interpre- 
tation helps explain why, as the number of iteration increases, the estimator 
eventually deteriorates. Indeed, by iteratively reducing the bias, one even- 
tually adds more variability than one reduces the bias. 

In Section 3, we discuss the behavior of the L2 Boosting of many com- 
monly used smoothers: smoothing splines, Nadaraya- Watson kernel and K- 
nearest neighbor smoothers. Unlike the good behavior of the L2 boosted 
smoothing splines discussed in Buhlmann and Yu [1] , we show that Boosting 
iC-nearest neighbor smoothers and kernel smoothers that are not positive 
definite produces a sequence of smoothers that behave erratically after a 
small number of iteration, and eventually diverge. The reason for the failure 
of the L2 Boosting algorithm, when applied to these smoothers, is that the 
bias is overestimated. As a result, the Boosting algorithm over-corrects the 
bias and produces a divergent smoother sequence. Section 4 discusses mod- 
ifications to the original smoother to ensure good behavior of the sequence 
of boosted smoothers. 

To control both the over-fitting and over-correction problems, one needs 
to stop the L2 Boosting algorithm in a timely manner. Our interpretation of 
the L2 Boosting as an iterative bias correction scheme leads us to propose 
in Section 5 several data driven stopping rules: Akaike Information Criteria 
(AIC), a modified AIC, Generalized Cross Validation (GCV), one and L- 
fold Cross Validation, and estimated prediction error estimation using data 
splitting. Using either the asymptotic results of Li [27l | or the finite sample 



oracle inequality of Hengartner et al. [2l|], we see that stopped boosted 
smoother has desirable statistical properties. We use either of these theorems 
to conclude that the desirable properties of the boosted smoother does not 
depend on the initial pilot smoother, provided that the pilot oversmooths 
the data. This conclusion is reaffirmed from the simulation study we present 
in Section 6. To implement these data driven stopping rules, we need to 
calculate predictions of the smoother for any desired value of the covariates, 
and not only at the observations. We show in Section 5 how to extend linear 
smoothers to give predictions at any desired point. 

The simulations in Section 6 show that when we combine a GCV based 
stopping rule to the L2 Boosting algorithm seems to work well. It stops 
early when the Boosting algorithm misbehaves, and otherwise takes advan- 
tage of the bias reduction. Our simulation compares optimum smoothers and 
optimum iterative bias corrected smoothers (using generalized cross valida- 
tion) for general smoothers without knowledge of the underlying regression 
function. We conclude that the optimal iterative bias corrected smoother 
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outperforms the optimal smoother. 

Finally, the proofs are gathered in the Appendix. 

2. Recursive bias estimation. In this section, we define a class of 
iteratively bias corrected linear smoothers and highlight some of their prop- 
erties. 

2.1. Bias Corrected Linear Smoothers. For ease of exposition, we shall 
consider the univariate nonparametric regression model in vector form ()1.2p 
from Section 1 

Y = m + £, 

where the errors e are independent, have mean zero and constant variance cr^, 
and are independent of the covariates X = {Xi, . . . , Xj G M. Extensions 
to multivariate smoothers are strait forward and we refer to Buja et al. 0] 
for example. 

Linear smoothers can be written as 

(2.1) mi = SY, 

where S is an n x n smoothing matrix. Typical smoothing matrices are 
contractions, so that IISl^H < ||^||, and as a result the associated smoother 
SY is called a shrinkage smoother (see for example Buja et al. [H])- Let / be 
the n X n identity matrix. 

The linear smoother (j2.1|) has bias 

(2.2) B{mi) =E[ihi\X]-m = {S - I)m 
and variance 

V{mi\X) = SS'a^, 

respectively. 

A natural question is "how can one estimate the bias?" To answer this 
question, observe that the residuals Ri = Y — rhi = (/ — S)Y have ex- 
pected value E[i?i|X] = m — E[mi|X] = {I — S)m = —B{fhi). This suggests 
estimating the bias by smoothing the negative residuals 

(2.3) bi := -SRi = -S{I - S)Y. 

This bias estimator is zero whenever the smoothing matrix 5" is a projection, 
as is the case for linear regression, bin smoothers and regression splines. 
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However, since most common smoothers are not projections, we have an 
opportunity to extract further signal from the residual and possibly improve 
upon the initial estimator. 

Note that a smoothing matrix other than S can be used to estimate the 
bias in (12. Sh . but as we shall see, in many examples, using S works very well, 
and leads to an attractive interpretation of Equation (j2.3p . Indeed, since the 
matrices S and I — S commute, we can express the estimated bias as 

Si = -S{I - S)Y = -{I - S)SY = {S- I)mi. 

We recognize the latter as the right-hand side of (j2.2p with the smoother 
fhi replacing the unknown vector m. This says that bi is a plug-in estimate 
for the bias B{mi). 

Subtracting the estimated bias from the initial smoother fhi produces the 
twicing estimator 

ifi2 = rhi — bi 

= {S + S{I-S))Y 

= (/-(/-5)2)y. 

Since the twiced smoother m2 is also a linear smoother, one can repeat the 
above discussion with m2 replacing fhi, producing a thriced linear smoother. 
We can iterate the bias correction step to recursively define a family of 
bias corrected smoothers. Starting with mi = SY, construct recursively for 
k = 2,3, . . ., the sequences of residuals, estimated bias and bias corrected 
smoothers 

h = -SRk^i = -{I-S)''~'SY 

(2.4) fhk = fhk-i-bk = fhk-i + SRk-i. 

We show in the next theorem that the iteratively bias corrected smoother 
fhk defined by Equation 12.41 has a nice representation in terms of the original 
smoothing matrix S. 

Theorem 2.1. The /c*^' iterated Mas corrected linear smoother fhk i2.4\) 
can he explicitly written as 

fhk = S[I+{I-S) + {I-Sf + --- + {I-Sf-^]Y 

(2.5) = [I - {I - S)'']Y = SkY. 
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Example with a Gaussian kernel smoother Throughout the next two 
sections, we shah use the following example to illustrate the behavior of the 
Boosting algorithms applied to various common smoothers. Take the design 
points to be 50 independently drawn points from an uniform distribution 
on the unit interval [0, 1]. The true regression function is m{x) = sin(57ra;), 
the solid line in the Figure [H and the disturbances are mean zero Gaussians 
with variance producing a signal to noise ratio of five. 

In the next figure, the initial smoother is a kernel one, with a bandwidth 
equals to 0.2 and a Gaussian kernel. This pilot smoother heavily oversmooths 
the data, see Figure[I]that shows that the pilot smoother (plain line) is nearly 
constant. The iterative bias corrected estimators are plotted in figure ([1]) for 
values of k, the number of iterations, in {1, 10, 50, 100, 500, 10^, 10^ 10^} 




Fig 1. True function mi (fat plain line) and different estimators varying with the number 
of iterations k. 

Figure [1] shows how each bias correction iteration changes the smoother, 
starting from a nearly constant smoother and slowly deforming (going down 
into the valleys and up into the peaks) with increasing number of iterations 
k = 10, k = 50 and k = 100. After 500 iterations, the iterative smoother is 
very close to the true function. However when the number of iterations is 
very large (here k = 10^ and 10^) the iterative smoother deteriorates. 

Lemma 2.2. The squared bias and variance of the A;*^' iterated bias cor- 
rected linear smoother fhk \2.4\ ) are 

B^{mk) = ({I - Sf^il - Sfm 

Y{m,) = a\l - {I - S)') {{I - {I - S)'))' . 
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Remark: Symmetric smoothing matrices S can be decomposed as S" = 
with orthonormal matrix Ps = [ui,U2, - ■■ ,Un] and diagonal ma- 
trix As- 

(2.6) nik = Ps diag(l - (1 - As)'')P'sY = ^(1 - (1 - Xj)'')uju'jY. 

j 

Applying Lemma 12.21 we get 

B\mk) = m'Psil-Asf'Psm 
\{rhk) = a^Ps{I-{I-As)'fP's. 

Hence if the magnitude of the eigenvalues oi I — S are bounded by one, 
each iteration of the bias correction will decrease the bias and increase the 
variance. This monotonicity (decreasing bias, increasing variance) with in- 
creasing number of iterations k allows us consider data driven selection for 
number of bias correction steps to achieves the best compromise between 
bias and variance of the smoother. 

The preceding remark suggests that the behavior of the iterative bias 
corrected smoother m is tied to the spectrum of I—S, and not of S. The next 
theorem collects the various convergence results for iterated bias corrected 
linear smoothers. 

Theorem 2.3. Suppose that the singular values \j = Xj{I — S) of I — S 
satisfy 

(2.7) -1<A,<1 for i = l,...,n. 

Then we have that 

\\bk\\ < \\bk-i\\ and lim bk = 0, 

k^oo 

\\Rk\\ < \\Rk-i\\ oind lim = 0, 

k^oo 

lim fhk = Y and lim E[||mfc — = na"^. 

k-^00 k—>oo 

Conversely, if I — S has a singular value \Xj\ > 1, then 

lim \\bk\\ = lim ||-Rfc|| = li™ ll^fell = 

Remark 1: This theorem shows that iterating the booting algorithm to 
reach the limit of the sequence of boosted smoothers, Y^o, is not the desirable. 
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However, since each iteration decreases the bias and increases the variance, 
a suitably stopped Boosting estimator is hkely to improve upon the initial 
smoother. 

Remark 2: When |Aj(/ — 5)| > 1, the iterative bias correction fails. The 
reason is that bk overestimates the true bias 6^, and hence Boosting re- 
peatedly overcorrects the bias of the smoothers, which results in a divergent 
sequence of smoothers. Divergence of the sequence of boosted smoothers can 
be detected numerically, making it possible to avoid this bad behavior by 
combining the iterative bias correction procedure with a suitable stopping 
rule. 

Remark 3: The assumption that for all j, the singular values —1 < Xj{I — 
S) < 1 implies that / — 5" is a contraction, so that ||(/ — S)Y\\ < \\Y\\. This 
condition does not imply that the smoother S itself is a shrinkage smoother 
as defined by (Buja et al. [H])- Conversely, not all shrinkage estimators sat- 
isfy the condition 12.71 of the theorem. In Section 3, we will given examples 
of common shrinkage smoothers for which |Aj(/ — S")! > 1, and show nu- 
merically that for these shrinkage smoothers, the iterative bias correction 
scheme will fail. 



2.2. L2 Boosting for regression. Boosting is one of the most successful 
and practical methods that arose 15 years ago from the machine learning 
community (Preund [11], Schapire [3l[). In light of Friedman 16], the Boost- 
ing algorithms has been interpreted as functional gradient descent technique. 
Let us summarize the L2 Boost algorithm described in Buhlmann and Yu 

a. 

Step 0: Set k = 1. Given the data {{Xi,Yi),i = l,...,n}, calculate an 
pilot regression smoother 

Fi{x) = Hx-Jxx), 
by least squares fitting of the parameter, that is, 

n 

ex,Y = argmin V(yi - hiXi,e)f. 
s ^=l 

Step 1: With a current smoother F^, compute the residuals Ui = Yi — 
Fk{Xi) and fit the real- valued learner to the current residuals by least square. 
The fit is denoted by fk+i{-)- Update 

(2.8) Fk+i{.) = + 

Step 2: Increase iteration index k by one and repeat step 1. 
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Lemma 2.4 (Buhlmann and Yu, 2003). The smoothing matrix associated 
with the k^^ Boosting iterate of linear smoother with smoothing matrix S is 

Fk = il-il- S)'')Y = BkY. 



Viewing Boosting as a greedy gradient descent method, the update for- 
mula ()2.8p is often modified to include convergence factor /i^, as in Friedman 
[l^ . to become 

Fk+l{-) = -ffc(-) + Afe+l/fc+l(")i 

where ftk+i is the best step toward the best direction fk+i[-)- 

This general formulation allows a great deal of flexibility, both in selecting 
the type of smoother used in each iteration of the Boosting algorithm, and 
in the selection of the convergence factor. For example, we may start with 
a running mean pilot smoother, and use a smoothing spline to estimate 
the bias in the first Boosting iteration and a nearest neighbor smoother to 
estimate the bias in the second iteration. However in practice, one typically 
uses the same smoother for all iterations and fix the convergence factor 
/i/c = /Li G (0,1). That is, the sequence of smoothers resulting from the 
Boosting algorithm is given by 

(2.9) Fk = {I-{I- fiS f)Y = BkY. 

We shall discuss in detail in Section H] the impact of this convergence 
factor and other modifications to the Boosting algorithm to ensure good 
behavior of the sequence of boosted smoothers. 

3. Boosting classical smoothers. This section is devoted to under- 
standing the behavior of the iterative Boosting schema using classical smoothers, 
which in light of Theorem 12. 31 depends on the magnitude of the singular val- 
ues of the matrix I — S. 

We start our discussion by noting that Boosting a projection type 
smoothers is of no interest because residuals (/ — S)Y are orthogonal to 
smoother SY. It follows that the smoothed residuals S{I — S)Y = 0, and as 
a result, = fhi for all k. Hence Boosting a bin smoother or a regression 
spline smoother leaves the initial smoother unchanged. 

Consider the i^-nearest neighbor smoother. Its associated smoothing 
matrix is Sij = 1/K if Xj belongs to the i^T-nearest neighbor of Xi and 
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Sij = otherwise. Note that this smoothing matrix is not symmetric. While 
this smoother enjoys many desirable properties, it is not well suited for 
Boosting because the matrix I — S has singular values larger than one. 

Theorem 3.1. In the fixed design or in the uniform design, as soon as 
the number of K is bigger than one and smaller than n, at least one singular 
value of I — S is bigger than 1. 

The proof of the theorem is found in the appendix. A consequence of 
Proposition 13. II and Theorem 12.31 is that the Boosting algorithm applied to 
a i^'-nearest neighbor smoother produces a sequence of divergent smoothers, 
and hence should not be used in practice. 



Example continued with i^-nearest neighbor smoother. We con- 
firm this behavior numerically. Using the same data as before, we apply 
the Boosting algorithm starting with an pilot X-nearest neighbor smoother 
with K = 10. The pilot estimator is plotted in a plain line, and the various 
boosted smoothers with A;, the number of iterations, valued in {2, • • • , 5} in 
dotted lines. 




Fig 2. True function mi (fat plain line) and different estimators varying with the number 
of iterations k . 



For k = 1, the pilot smoother is nearly constant (since we take K = 10 
neighbors) and very quickly the iterative bias corrected estimator explodes. 
Qualitatively, the smoothers are getting higher at the peaks and lower in 
the valleys, which is consistent with an overcorrection of the bias. Contrast 
this behavior with the one shown in Figure 1. 



imsart-aos ver. 2007/04/13 file: paper.tex date: February 5, 2008 



12 



Kernel type smoother. For Nadaraya kernel type estimator, the smooth- 
ing matrix S has entries Sij = Kh[Xi — Xj)/J2k ^hi^i ~ ^k)-, where K{.) 
is a symmetric function (e.g., uniform, Epanechnikov, Gaussian), h denotes 
the bandwidth and Kfi{-) is the scaled kernel Khit) = h~^K{t/h). The 
matrix S is not symmetric but can be written as S = DM. where IK is sym- 
metric with general element [Kh{Xi — Xj)] and D is diagonal with element 
^/J2j Kh{Xi—Xj). Algebraic manipulations allows us to rewrite the iterated 
estimator as 

rhk = [/-(/-S)^]y 

= [/ - (Z)V2^-l/2 _ ^l/2^1/2j^^l/2^-l/2)fc]y 
= [/-i?l/2(/-I)l/2lf^DV2)fc^-l/2]y 

= D^/'^[I - {I - A)'']D~'^/^Y. 

hmce the matrix A = D^/^KD^/^ is symmetric, we apply the classical de- 
composition A = P^A^P^, with Pa orthonormal and Aa diagonal, to get a 
closed form expression for the boosted smoother 

Thk = D'/^Pa[I-{I-Aa)'']P'aD~'/'Y. 

The eigen decomposition of j4 = D^/^KD^/^ can be used to describe the 
behavior of the sequence of iterative estimators. In particular, any eigenvalue 
of A = Z^^/^IKZ?^/^ that is negative or greater than 2 will lead to unstable 
procedure. If the kernel K{-) is a symmetric probability density function 
positive definite, then the spectrum of the Nadaraya- Watson kernel smoother 
lies between zero and one. 

Theorem 3.2. If the inverse Fourier- Stieltjes transform of a kernel K{-) 
is a real positive finite measure, then the spectrum of the Nadaraya- Watson 
kernel smoother lies between zero and one. 

Conversely, suppose that Xi, . . . ,X„ are an independent n-sample from 
a density f (with respect to Lebesgue measure) that is bounded away from 
zero on a compact set strictly included in the support of f. If the inverse 
Fourier- Stieltjes transform of a kernel K{-) is not a positive finite measure, 
then with probability approaching one as the sample size n grows to infinity, 
the maximum of the spectrum of I — S is larger than one. 

Remark 1: Since the spec(74) is the same as the spec(5) and S is row 
stochastic, we conclude that spec(74) < 1. So we are only concern by the 
presence of negative eigenvalues in the spectrum of A. 
Remark 2: In Di Marzio and Taylor [lO| proved the first part of the the- 
orem. Our proof of the converse shows that for large enough sample sizes, 
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most configurations from a random design lead to smootliing matrix S with 
negative singular values. 

Remark 3: The assumption that the inverse Fourier-Stieltjes transform of 
a kernel K(-) is a real positive finite measure is equivalent to the kernel 
K{-) being positive a definite function, that is, for any finite set of points 
xi, . . . , Xm, the matrix 

/ K{0) K{xi - X2) K{xi -xs) ... K{xi - x^) \ 

K{X2 - Xi) K{0) K{X2 - X3) ... K{X2 - Xm) 

\ K{Xm - Xi) K{Xm - X2) K{xm - Xs) ... K{^) ) 

is positive definite. We refer to Schwartz [s^] for a detailed study of positive 
definite functions. 

The Gaussian and triangular kernels are positive definite kernels (they are 
the Fourier transform of a finite positive measure (Feller and in light 

of Theorem 13.21 the Boosting of Nadaraya- Watson kernel smoothers with 
these kernels produces a sequence of well behavior smoother. However, the 
uniform and the Epanechnikov kernels are not positive definite. Theorem l3.2l 
states that for large samples, the spectrum of / — S is larger than one and as 
a result the sequence of boosted smoother diverges. Proposition 13.31 below 
strengthen this result by stating that the largest singular value of / — 5 is 
always larger than one. 

Proposition 3.3. Let S he the smoothing matrix of a Nadaraya- Watson 
regression smoother based on either the uniform or the Epanechnikov kernel. 
Then the largest singular value of I — S is larger than one. 

Example continued with Epanechnikov kernel smoother. In the 

next figure, the pilot smoother is a kernel one with an Epanechnikov kernel 
and with bandwidth is equal to 0.15. The pilot smoother is the plain line, 
and the subsequent iterations with k, the number of iterations, valued in 
{1, 2, 5, 10, 20, 50, 100}, are the dotted lines. 
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Fig 3. True function mi (fat plain line) and different estimators varying with the number 
of iterations k. 

For k = 1, the pilot smoother oversmooths the true regression since the 
bandwidth takes almost one third of the data and very quickly the itera- 
tive estimator explodes. Contrast this behavior with the one shown by the 
Gaussian kernel smoother in Figure [H 

Finally, let us now consider the smoothing splines smoother. The 
smoothing matrix S is symmetric, and therefore admits an eigen decompo- 
sition. Denote by {ui,U2, ■ • • ,Un} an orthonormal basis of eigenvectors of S 
associated to the eigenvalues 1 > Ai > A2 > • • • > A„ > (Utreras js^). 
Denote by P5 = [ui, U2, • • • , u„] the orthonormal matrix of column eigenvec- 
tors and write S = Ps diag{Xs)Ps, that is S = J2j The iterated bias 
reduction estimator is given by (12. 6p . Since all the eigenvalues are between 
and 1, then if k is large, the iterative procedure kills the eigenvalues less 
than 1 and put the others to 1. 

Example continued with smoothing splines smoother In the next 
figure, the pilot smoother is a smoothing spline, with A equals to 0.2. The 
different estimators are plotted in figure (jl]), with the pilot estimator in 
plain line and the boosted smoothers with number of iterations k being 
{10, 50, 100, 500, 10^ , 10^ 10*^} in dotted lines. 
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Fig 4. True function mi (fat plain line) and different estimators varying with the number 
of iterations k. 

The pilot estimator is more variable than the pilot estimator of figure [T] 
and by the way the convergence and the deterioration arise faster. 

4. Smoother engineering. Practical implementations of the Boosting 
algorithm include a user selected convergence factor /j, € (0, 1) that appears 
in the definition of the boosted smoother 

(4.1) rhk = (I - {I - fiS)'')Y = BkY. 

In this section, we show that when fi < 1, one effectively operates a partial 
bias correction. This partial bias correction does not however resolve the 
problems associated with Boosting a nearest neighbor or Nadaraya Watson 
kernel smoothers with compact kernel we exhibited in the previous sec- 
tion. To resolve these problems, we propose to suitably modify the boosted 
smoother. We call such targeted changes smoother engineering. 

The following iterative partial bias reduction scheme is equivalent to the 
Boosting algorithm defined by Equation (j4.ip : Given a smoother rhk = B^Y 
at the fc*^ iteration of the Boosting algorithm, calculate the residuals 
and estimated bias 

Rk = Y-mk = {I-Bk)Y 
h = SRk = S{I - Bk)Y. 

Next, given < fi < 1, consider the partially bias corrected smoother 

(4.2) mfc+i = fhk + nh- 

imsart-aos ver. 2007/04/13 file: paper.tex date: February 5, 2008 



16 



Algebraic manipulations of the smoothing matrix of the right-hand side of 
(|i:2]) yields 

Bk + fiSil - Bk) = I - (I - fiS)^+\ 

from which we conclude that fhk+i satisfies ()4.ip and therefore is the {k-\-lY^ 
iteration of the Boosting algorithm. It is interesting to rewrite ()4.2p as 

ihk+i = (1 - f-i)mk + mk + hk 

which shows that boosted smoother ifik+i is a convex combination between 
the smoother at iteration k, and the fully bias corrected smoother + 
As a result, we understand how the introduction of a convergence factor 
produces a "weaker learner" than the one obtained for fi = 1. 

In analogy to Theorem 12.31 the behavior of the sequence of the smoother 
depends on the spectrum of I—fiS. Specifically, if max^ \Xj{I—fiS)\ < 1, then 

linifc >oo "T-fc = ^,and conversely, if maxj \Xj{I — fj,S)\ > 1, lim^ ||"t.a;|| = 

oo. Inspection of the proofs of propositions 13.11 and 13.21 reveal that the spec- 
trum of (/ — iiS) for both the nearest neighbor smoother and the Nadaraya 
Watson kernel smoother has singular values of magnitude larger than one. 
Hence the introduction of the convergence factor does not help resolve the 
difficulties arising when Boosting these smoothers. 

To resolve the potential convergence issues, one needs to suitably modify 
the underlying smoother to ensure that the magnitude of the singular values 
of I—I-lS are bounded by one. A practical solution is to replace the smoothing 
matrix S hy S* = SS^. If S is a contraction, it follows that the eigenvalues of 
I — S* are nonnegative and bounded by one. Hence the Boosting algorithm 
with this smoother will produce a well behaved sequence of smoothers with 
linifc^^oo fnk = Y. 

While substituting the smoother S* for S can produces better boosted 
smoothers in cases where Boosting failed, our numerical experimentations 
has shown that the performance of Boosting S* is not as good as Boost- 
ing S when the pilot estimator enjoyed good properties, as is the case for 
smoothing splines and the Nadaraya Watson kernel smoother with Gaussian 
kernel. 



5. Stopping rules. Theorem 12. 31 in Section[2]states that the limit of the 

sequence of boosted smoothers is either the raw data Y or has norm = 
oo. It follows that iterating the Boosting algorithm until convergence is not 
desirable. However, since each iteration of the Boosting algorithm reduces 
the bias and increases the variance, often a few iteration of the Boosting 
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algorithm will produce a better smoother than the pilot smoother. This 
brings up the important question of how to decide when to stop the iterative 
bias correction process. 

Viewing the latter question as a model selection problem suggests stop- 
ping rules based on Mallows' Cp (Mallows (2^), Akaike Information Cri- 
terion, AIC, (Akaike [3]), Bayesian Information Criterion, BIC, (Schwarz 
13311). and Generalized cross ;ahdation (Craven and Wahba 0). Each of 
these selectors estimate the optimum number of iterations k of the Boosting 
algorithm by minimizing estimates of the expected squared prediction error 
of the smoothers over some pre-specified set /C = {1, 2, . . . , M}. 

Three of the six criteria we study numerically in Section 6 use plug-in 
estimates for the squared bias and variance of the expected prediction mean 
square error. Specifically, consider 

(5.1) kAIC = argmin + 2^^^H^j, 

(5.2) kccv = argmin(loga2-21ogfl-^^^^)j, 

ON f . f, ^2 2(trace(5fc) + 1) 1 

(5.3) kAiCc = argmm loga^ + l+ \ ' \. 

k£K I n - trace(Sfc) - 2J 



In nonparametric smoothing, the AIC criteria (jS.ip has a noticeable ten- 
dency to select more iterations than needed, leading to a final smoother 
^ that typically undersmooths the data. As a remedy, Hurvich et al. 



2j] introduced a corrected version of the AIC (j5.3p under the simplifying 
assumption that the nonparametric smoother m is unbiased, which is rarely 
hold in practice and which is particularly not true in our context. 

The other three criteria considered in our simulation study in Section 6 
are Cross- Validation, L-fold cross-validation and data splitting, all of which 
estimate empirically the expected prediction mean square error by splitting 
the data into learning and testing sets. Implementation of these criterion 
require one to evaluate the smoother at locations outside the of the design. 
To this end, write the k^^ iterated smoother as a /c times bias corrected 
smoother 

rhk = mo + 61 H \-bk 

= S[I + {I -S) + {I -Sf + ■■■ + {!- S)''~^]Y, 

which we rewrite as 
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where 

h = [I+[I-S) + {I-Sf + --- + {I-Sf~^]Y 
= Y + Ri + R2 + --- + Rk 

is a vector of size n. Given the vector S{x) of size n whose entries are the 
weights for predicting 'm{x), we calculate 

(5.4) mk{x) = S{x)%. 

This formulation is computationally advantageous because the vector of 
weights S{x) only needs to be computed once, while each Boosting iter- 
ation updates the parameter vector (3^ by adding the residuals Rk = Y — rhk 
of the fit to the previous value of the parameter, i.e., (3^ = Pk-i + Rk- The 
vector S{x) is readily computed for many of the smoothers used in practice. 
For kernel smoothers, the z*'* entry in the vector S{x) is 

Kh{x-Xi) 

"'^""^-EjK.ix-x.y 

For smoothing spline, let N(x) denote the vector of basis function evaluated 
at X. One can show that rhk{x) = N{x)Mf3k, where M is the n x n matrix 
given by 

M = (iV*iV + A0)-^7V*. 
Finally, for the K-nn smoother, the entries of the vector S{x) are 




1/K if is a K-nn of x 
otherwise 



We note that if the spectrum of / — is bounded in absolute value by one, 
then the parameter — > Poo , and hence we have pointwise convergence of 
fhk{x) to some mao{x), whose properties depend on S{x). 

To define the data splitting and cross validation stopping rules, one di- 
vides the sample into two disjoint subset: a learning set C which is used to 
estimate the smoother m^, and a testing set T on which predictions from 
the smoother are compared to the observations. The data splitting selector 
for the number of iterations is 

(5.5) %DS = argmin V (yj - (Xj)) . 
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One-fold cross validation, or simply cross validation, and more generally L- 
fold cross validation average the prediction error over all partitions of the 
data into into learning and testing sets, with fixed size of the testing set 
\T\ = L. This leads to 

(5.6) kcv = argmin ^ E " ^kiX^)) . 

'^"^'C \T\=LieT 



We rely on the expansive literature on model selection to provide insight 
into the statistical properties of stopped boosted smoother. For example. 
Theorem 3.2 of Li [27| describes the asymptotic behavior of the generalized 
cross-validation (GOV) stopping rule applied to spline smoothers. 

Theorem 5.1 (Li, 1987). Assume that Li's assumptions are verified for 
the smoother S. Then 



inffcgx; \\m - S'fcyp 



1 in probability. 



Results on the finite sample performance for data splitting for arbitrary 
smoothers is presented in Theorem 1 of Hengartner et al. [21[ who proved 
the following oracle inequality. 



Theorem 5.2. For each k in IC, X > and a > 0, we have 



P 



{-j^ n+m n+m 
— Y [rhKos - - - X! {rhk- mf{Xi 
i=n+l i=n+l 



> X 



< \K\ 



32(1 -Fa)cT2 
iramX 



exp 



amX 
8(l + a)a2 



Example continued with smoothing splines 

Figure [5] shows the three pilot smoothers (smoothing splines with different 
smoothing parameters) considered in the simulation study in Section 6. 
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Fig 5. True function mi (plain line) and different pilot smoothing spline smoother, ^(Ai) 
(dotted line), S{X2) (dashed line),S{\-i) (dash-dotted line) for the 50 data points of one 
replication (Gaussian error). 



Starting with the smoothest pilot smoother ^(Ai), the GeneraUzed Cross 
Vahdation criteria stops after 1389 iterations. Starting with smoother S{\2), 
GCV stopped after 23 iterations, while starting with the noisiest pilot ^(As), 
GCV stopped after one iteration. It is remarkable how similar the final 
smoother are. 




OO 02 04 06 08 iTo 



Fig 6. True function mi (plain line) and different pilot smoothing spline smoother, ^(Ai) 
(dashed line), S(A2) (dotted line),S{\3) (dash-dotted line) for the same 50 data points as 
in figure\^ of one replication (Gaussian error). 



The final selected estimators are very close to one another, despite the dif- 
ferent pilot smoothers and the different numbers iterations that were selected 
by the GVC criteria. Despite the flatness of the pilot smoother ^(Ai), it suc- 
ceeds after 1389 iteration to capture the signal. Note that larger smoothing 
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parameter A are associated to weaker learners that require a larger number of 
bias correction iterations before they become desirable smoothers according 
the the generalized cross validation criteria. A close examination of figure [6] 
shows that using the less biased estimator ^(Aa) leads to the worse final esti- 
mator. This can be explained as follows: if the pilot smoother is not enough 
biased, after the first step almost no signal is left in the residuals and the 
iterative bias reduction is stopped. 

We remark again that one does not need to keep the same smoother 
throughout the iterative bias correcting scheme. We conjecture that there 
are advantages to using weaker smoothers later in the iterative scheme, and 
shall investigate this in a forthcoming paper. 

6. Simulations. This section presents selected results from our simu- 
lation study that investigates the statistical properties of the various data 
driven stopping rules. The simulations examine, within the framework set 



by Hurvich et al. [2j], the impact on performance of various stopping rules, 
smoother type, smoothness of the pilot smoother, sample size, true regres- 
sion function, and the relative variance of the errors as measured by the 
signal to noise ratio. 

We examine the influence of various factors on the performance of the 
selectors, with 100 simulation replications and a random uniform grid in 
[0, 1]. The error standard deviation is o" = 0.2Rg, where Rg is the range of 
g{x) over x G [0, 1]. For each setting of factors, we have 

(A) sample size: n = 50, 100 and 500 

(B) the following 3 regression functions, most of which were used in earlier 
studies 

1. m{x) = sin(57rx), 

2. m(x) = 1 - 48x + 218x2 - 315x3 + 145x^, 

3. m(x) = exp (x — ^){x < |} + exp[— 2(x — |)]{x > i}. 

(C) error distribution: Gaussian and Student (5) 

(D) pilot smoothers: smoothing splines, Gaussian kernel, iiT-nearest neigh- 
bor type smoother 

(E) three starting smoothers: Si, S2 and S3 by decreasing order of smooth- 
ing. 

For each setting, we compute the ideal numbers of iterations by computing 
at data points {Xi}f^i 

n 

kopt = argmin^ ||m(Xj) - mfc(Xi)|p. 
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Since the results are numerous we report here a summary, focusing on the 
main objectives of the paper. 

First of ah, does the stopping procedures proposed in section [5] work ? 
Figure [7] represents the kernel density estimates of the log ratios of the 
number of iterations to the ideal number of iterations for the smoothing 
spline type smoother. 




Fig 7. Estimated density of log(fc/fcopt), k evaluated by different stopping criterion : 
GCV, CV (leave one out), CV 5 fold (CVS), data splitting (DS), AIC and corrected AIC 
(AICc). Density is estimated on 100 replication for function mi , with Gaussian error, 
spline smoother S2 and n = 50 data points. 



Obviously, negative values indicate undersmoothing (fc smaller than fcoptj 
that is not enough bias reduction) while positive values indicate oversmooth- 
ing. The results remain essentially unchanged over the range of starting val- 
ues, regression function and smoothers types we considered in our simulation 
study. 

For small data sets (n = 50), the stopping rule based on data splitting 
produced values for k that were very variable. A similar observation about 
the variability of bandwidth selection from data splitting was made in [see 



2l[ . We also found that the five fold cross validation stopping rule produced 



highly variable values for k. 

The AIC stopping rule selects values k that are often too big (oversmooth- 
ing) and sometimes selects the largest possible value of k € K. In that cases, 
the curve k versus AIC (not shown) indicates two minimum, a local one 
which is around the true value and the global one at the boundary. This can 
be attributed to the fact that the penalty term used by AIC is too small. 
The AICc criteria uses a larger penalty term, which leads to smaller values 
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for k. In fact, the selected values are typically smaller than the optimal one. 
The penalty associated with GCV lies in between the AICc penalty and AIC 
penalty, and produces in practice values of k that are closer to the optimum 
than either AIC or AICc. Finally, the leave one out cross-validation selection 
rule produces k that are typically larger than the optimal one. 

Investigation of the MSE as a function of the number of iterations k 
reveal that, in the examples we considered, that function decreases rapidly 
towards its minimum and then remains relatively flat over a range of values 
to right of the minimum. It follows that the loss of stopping after kopt is 
less than stopping before /copt- We verify this empirically as follows: for each 
estimate, we calculate the approximation to the integrated mean squared 
error between the estimator and the true regression function m 



MSE(mr 



1/100^1 
xeS 



m{x) — rhj^{x) 



where C/ is a fix grid of 100 regularly spaced points in the unit interval [0, 1]. 
We partition the calculated integrated mean squared error depending on 
whether k is bigger than fcopt or smaller than fcopt . Figure [8] presents the 
boxplot of the integrated mean squared error when k over-estimates /copt 
and when it under-estimates /copt and clearly shows that over-estimating 
fcopt leads to smaller integrated mean squared error than under-estimating 



opt • 



GCV + 



GCV- 



CV + 



CV- 



FlG 8. Boxplot o/MSErfi- when fcccv > fcopt (denoted as GCV+), of mean squared error 

°f fkQQ^ ^hen kacv < fcopt (denoted as GCV-), and the same boxplots with leave one 
out stopping criterion. Mean squared error are estimated on a grid of 100 points regu- 
larly spaced between and 1, 100 replication for function mi, with Gaussian error, spline 
smoother S2 and n = 50 data points. 
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For bigger data sets, say n = 100 or bigger, most of the stopping cri- 
terion act the same except for the modified AIC which tends to select a 
smaller number of iterations k than the ideal one. One fold cross-validation 
is rather computational intensive as the usual relation between cross vali- 



dated estimator at Xi and full data estimator is no longer valid [e.g. Il9|, p. 
47]. 

These conclusions remain true for kernel smoother and nearest neighbor 
smoothers. However if the pilot smoother is not smooth enough (not biased 
enough), then the number of iteration is too small to allow us to discriminate 
between the different stopping rules. These initial smoothers we name as 
wiggly learner are almost unbiased and therefore, there is little value to 
apply an iterative bias correction scheme. In conclusion, for small data sets, 
our simulations show that both GCV and leave one cross-validation work 
well, and for bigger data sets, we recommend using GCV. 

Tables ([TJ and ([2]) here below report the finite sample performance of 
stopped boosted smoother by the GCV criterion. Each entry in the table 
reports the median number of iterations and the median mean square error 
over 100 simulations. As expected, larger smoothing parameter of the ini- 
tial smoother require more iterations of the boosting algorithm to reach its 
optimum. Interestingly, the selected smoother starting with a very smooth 
smoother, has slightly smaller mean squared error. The quantify the ben- 
efits of the iterative bias correction scheme, the last column of the tables 
gives the mean squared error of the original smoother with smoothing pa- 
rameters selected using GCV. In all cases, the iterative bias correction has 
smaller mean squared error than the "one-step" smoother, with improve- 
ments ranging from 15% to 30%. 

Table ([1]) presents the results for smoothing splines. 



Function mi 


error 


fell 






ka 




S'(Agcv) 


Gaussian 


4077 


0.0273 


65 0.0282 


2 


0.0293 


0.0379 


student 


4115 


0.0273 


70 0.0286 


2 


0.0296 


0.0352 


Function m2 


Gaussian 


1219 


0.0798 


21 0.0845 


1 


0.0837 


0.0829 


student 


1307 


0.0887 


22 0.0944 


1 


0.0932 


0.0937 


Function mg 


Gaussian 


135 


0.0014 


3 0.0014 


1 


0.0016 


0.0016 


student 


147 


0.0016 


3 0.0016 


1 


0.0018 


0.0019 



Table 1 

Median over 100 simulations of the number of iterations and the MSB for smoothing 
splines smoother, n = 50 data points. S(Agcv) denotes the traditional smoothing splines 

estimate with A chosen with GCV. 
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Table ([2]) presents the results for kernel smoothers with a Gaussian kernel. 



Function mi 


error 


fell 




^2 •S';;^Qcv(^2) 






5(/lAICc) 


Gaussian 


385 


0.0231 


27 0.0254 


4 


0.0368 


0.04857 


student 


360 


0.0221 


25 0.0262 


4 


0.0353 


0.05199 


Function m2 


Gaussian 


330 


0.0477 


128 0.0581 


14 


0.0782 


0.1175 


student 


1621 


0.0563 


160 0.0660 


16 


0.0754 


0.1184 


Function ms 


Gaussian 


30 


0.0017 


7 0.0016 


2 


0.0014 


0.00178 


student 


29 


0.0017 


8 0.0016 


2 


0.0016 


0.0018 



Table 2 

Median over 100 simulations of the number of iterations and the MSE for Gaussian 
kernel smoother, n = 50 data points. S(h aiCc) denotes the bandwidth chosen by the 

modified AIC criteria. 



The simulation results reported in the above tables show that the iterative 
bias reduction scheme works well in practice, even for moderate sample sizes. 
While starting with a very smooth pilot requires more iterations, the mean 
squared error of the resulting smoother is somewhat smaller compared to a 
more noisy initial smoother. Figures [5] and [6] also support this claim. 



7. Discussion. In this paper, we make the connection between iterative 
bias correction and the L2 boosting algorithm, thereby providing a new 
interpretation for the latter. A link between bias reduction and boosting was 
suggested by [301] in his discussion of the seminal paper [17| , and explored in 
Di Marzio and Taylor 0, for the special case of kernel smoothers. In this 
paper, we show that this interpretation holds for general linear smoothers. 

It was surprising to us that not all smoothers were suitable to be used for 
boosting. We show that many weak learners, such as the fc-nearest neighbor 
smoother and some kernel smoothers, are not stable under L2 boosting. Our 
results extend and complement the recent results of Di Marzio and Taylor 
i- 

Iterating the boosting algorithm until convergence is not desirable. Better 
smoothers result if one stops the iterative scheme. We have explored, via 
simulations, various data driven stopping rules and have found that for the 
linear smoothers, the Generalized Cross Validation criteria works very well, 
even for moderate sample sizes of 50. In our simulations show that optimally 
correcting the bias (by stopping the L2 boosting algorithm after a suitable 
number of iterations) produced better smoothers than the one with the best 
data-dependent smoothing parameter. 
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Finally, the iterative bias correction scheme can be readily extended to 
multivariate covariates X, as in Buhlmann 0]. 

APPENDIX A: APPENDIX 

Proof of Theorem [MlToshowESl let S = I+{I-S) + - ■ ■ + {I-S)''~'^. 
The conclusion follows from a telescoping sum argument applied to 

5S = S - (/ - 5)S = / - (/ - 5)^ 

Proof of Theorem [23] 

\\bkf = \\-{I-S)''~'SYf 

= 11(7 - S){I - S)''-'SY\\' < 11(7 - S)f\\bk.if 

where the last inequality follows from the assumptions on the spectrum of 
I — S. Similarly, one shows that 

\\R,f = \\{I-S)'^Yf <\\I-Sf\\Rk^,f <\\Rk-if. 

Proof of Theorem 13.11 To simplify the exposition, let us assume that 
the Xj's are ordered. Let us consider the K-nn smoother the matrix S is of 
general term 

S^J = ^ if XjeK-nn{X,). 
Is. 

In order to bound the singular values of {I — S), consider the eigen values 
of (7 — S){I — Sy which are the square of the singular values of I — S. Since 
A = (7 — 5) (7 — sy is symmetric, we have for any vector x that 

A.l A„ < < Ai. 

x'x 

Let us find a vector x such that x'Ax > x'x. First notice that 

Next, consider the vector x of R" that is zero every where except at position 
(i — /i) (respectively i and i + h) where its value is -1 (respectively 2 and 
-1). For this choice, we expand x'Ax to get 



i+h.i—h 



x'Ax = A.^i^^i^i^ + 4Ai^i + Ai+i^^i+i^ - 4:Ai_i^^i - 4Au+i^ + 2A. 
6 
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To show that this last quantity is larger than x^x = 6, we need to suitably 
bound the off-diagonal elements oiA = I — S — S' + SS'. To bound Aij, 
where j = i + I and I < K, we need to consider three cases: 

1. If Xj belongs to the i^'-nn of Xj and vice versa, then Sij = Sj^ = 1/K. 
This does not mean that all the K-nn neighbor of Xi are the same as 
those of Xj, but if it is the case, then {SS')ij < K/K'^ and otherwise 
in the pessimistic case, we bound {SS')ij > (/ + 1)/K'^. It therefore 
follows that 

N , 9 2 , K 2 1 

(i+l)/A=--< <^-- = --. 

2. If Xi belongs to the K-nn of Xj Sij = 1/K but Xj does not belong 
to the K-nn of Xi then S'ji = 0. There is at a maximum of K — 1 
points that are in the J^-nn of Xi and in the K-nn of Xj so {SS')ij < 
(K — 1)/K'^. In the pessimistic case, there is only one point, which 
leads to the bound 

11 i^- 1 1 1 

3. If Xi does not belong to the K-nn of Xj Sij = and Xj does not 
belong to the K-nn of Xi then S'j^ = 0. However there are potentially 
as many as Z — 1 points that are in the K-nn of Xj and in the K-nn of 
Xj, so that {SS')ij <{l- l)/K'^. In that case 

l-l K-2 
< Aij < < 



With these bounds for the off-diagonal terms, we are able to major x' Ax. 

Before continuing, we need to discuss the relative position of the points 
Xi_i^ , Xi and Xij^i^ . We chose them such that 

Xi_i^ G K-nn{Xi) and X^ G K-nn{Xi_i^). 

For this choice, we calculate 

11 + l-2K 1 

12 + I-2K 1 

^2 - "^hi+h - 



so that 



6 8 2 

— + — + 2Ai+^_i < x'Ax < 6 + — + 2Ai+i. 
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The latter shows that x'Ax > x'x whenever 

1 

which is always true if the condition 

Xi_i^ ^ K-iin{Xi+i^) or Xi+i^ K-nn{Xi_i^) 

is satisfied because in such case, we have 

1 1 

Proof of Theorem 13.21 Let Xi, . . . , X^ is an i.i.d. sample from a density 
/ that is bounded away from zero on a compact set strictly included in the 
support of /. Consider without loss of generality that /(x) > c > for all 
|x| < b. 

We are interested in the sign of the quadratic form u''Au where the indi- 
vidual entries Aij of matrix A are equal to 

, _ Kh{X,-X,) 



^T.iKh{X,-Xi)JY.i Kh{Xj - Xi) 



Recall the definition of the scaled kernel Kh{-) = K {■ / h) / h. li v \s the vector 
of coordinate Vi = Ui/y^J2i ^hi^i — ^i) then we have u^Au = v^Kv, where 
]K is the matrix with individual entries Kfi{Xi — Xj). Thus any conclusion 
on the quadratic form v^Wv carry on to the quadratic form u^Au. 

To show the existence of a negative eigenvalue for K, we seek to construct a 
vector U = {Ui{Xi), . . . , C/„(X„)) for which we can show that the quadratic 
form 

n n 

U'KU = J2J2 Uj{Xj)Uk{Xk)Kh{Xj - Xk) 
j=i k=i 

converges in probability to a negative quantity as the sample size grows to 
infinity. We show the latter by evaluating the expectation of the quadratic 
form and applying the weak law of large number. 

Let f{x) be a real function in L2, define its Fourier transform 

and its Fourier inverse by 
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For kernels K(-) that are real symmetric probability densities, we have 

k{t) = kinvit). 

From Bochner's theorem, we know that if the kernel K{-) is not positive 
definite, then there exists a bounded symmetric set A of positive Lebesgue 
measure (denoted by |^|), such that 



(A.2) 



k{t) < yt£A. 



Let (p{t) G L2 be a real symmetric function supported on A, bounded by B 
(i.e. \^{t)\ < B). Obviously, its inverse Fourier transform 



ip{x) 



'2nixt 



(p{t)dt 



is integrable and by virtue of Parceval's identity 

||(^||2 = ||^||2 < ^2j^| < 



Using the following version of Parceval's identity [seell4l. p. 620] 



00 J —00 



00 roo 



v{xMy)K{x-y)dxdy= / mt)\'K{t)dt, 

J —00 

which when combined with equation (|A.2|) . leads us to conclude that 

ip{x)ip{y)K{x — y)dxdy < 0. 
Consider the following vector 



nn 



^£^l{\X^\<b) 



With this choice, the expected value of the quadratic form is 



E[Q] 



E 

1 

n 



^ Uj{Xj)Uk{Xk)Kh{Xj - Xk) 
j,k=i 

^ ^{s/hfKh{Q)ds 



f{s)h^ 



+ - 



n — n 



ip{s/h)ip{t/h)Kh{s - t)dsdt 



h+h. 
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We bound the first integral 



n/i2 J_i, f(s) 



< ; — / (/9(m) du 

nch J^b/h 

- ch^ 

Observe that for any fixed value h, the latter can be made arbitrarily small 
by choosing n large enough. We evaluate the second integral by noting that 

h = (^1 - /i"^ ^{s/h)ip{t/h)Kh{s - t)dsdt 

/ 1\ 1 rb/h rb/h 

(A.3) = 1 - - / (p(u)ip(v)K(u - v)dudv. 

\ n) J^b/hJ-b/h 



By virtue of the dominated convergence theorem, the value of the last 
integral converges to j'^^\(p{t)\^K{t)dt < as /i goes to zero. Thus for 
h small enough, (|A.3p is less than zero, and it follows that we can make 
E[Q] < by taking n > no, for some large uq. Finally, convergence in 
probability of the quadratic form to its expectation is guaranteed by the 
weak law of large numbers for U statistics (see Grams and Serfling 18|] for 
example). The conclusion of the theorem follows. 

Proof of Proposition [3T3] We are interested in the sign of the quadratic 
form u*]fCii (see proof of Theorem 13. 2p . Recall that if IK is semidefinite then 
all its principal minor [seel23|, p. 398] are nonnegative. In particular, we can 
show that A is non-positive definite by producing a 3 x 3 principal minor with 
negative determinant. To this end, take the principal minor IC[3] obtained 
by taking the rows and columns (^i, ^3)- Without loss of generality, let us 
assume that Xj^ < Xi^ < Xi^. The determinant of ]K[3] is 

det{K[3]) = Kh{0)[Kh{Of -Kh(Xi,-Xi,f 
—Kh{Xi2 — XjJ 

X [Kh{0)Kh{X,, -Xi,)- Kh{Xi, - X,,)Kh{Xi, - X,,)] 
+KhiXi^ - XjJ 

X [KhiX,, - Xi,)Kh{X,, -Xi,)- KhiO)Kh{Xi, - X,,)] . 

Let us evaluate this quantity for the uniform and Epanechnikov kernels. 
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Uniform kernel. Let h be larger than the minimum distance between three 
consecutive points, and chose the index ii,i2,i3 such that 

Xi^ - Xi^ < h, Xi.^ - Xi^ < h, and Xi^ - Xi-^ > h. 

With this choice, we readily calculate 

det{K[3]) = - KhiO) \KhiOf - oj - < 0. 

Since a principal minor of IC is negative, we conclude that ]K and A are not 
semidefinite positive. 



Epanechnikov kernel. For ii, ^3 fixed, denote by x = Xi^ — and by 

y = —Xi^, and assume that h > min(x, y). The determinant (iet(]fC[3]) is 
a bivariate function of x and y (as Xjg —Xi^ = x + y). Numerical evaluations 
of that function show that as soon as we have the range of the three points 
less than the bandwidth, the determinant of ]K[3] is negative. 




Fig 9. Contour o/def(K[3]) as a function of (x,y). 



Thus a principal minor of K is negative, and as a result, IC and A are not 
semidefinite positive. 
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